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Abstract 

This paper affirms that quantification of life-critical 
software reliability is infeasible using statistical meth- 
ods whether applied to standard software or fault- 
tolerant software. The key assumption of software 
fault tolerance — separately programmed versions fail 
independently — is shown to be problematic. This as- 
sumption cannot be justified by experimentation in the 
ultrareliability region and subjective arguments in its 
favor are not sufficiently strong to justify it as an ax- 
iom. Also, the implications of the recent multiversion 
software experiments support this affirmation. 
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1 Introduction 

The potential of enhanced flexibility and functionality 
has led to an ever increasing use of digital computer 
systems in control applications. At first, the digital 
systems were designed to perform the same functions 
as their analog counterparts. However, the availabil- 
ity of enormous computing power at a low cost has led 
to expanded use of digital computers in current applica- 
tions and their introduction into many new applications. 
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Thus, larger and more complex systems are being de- 
signed. The result has been, as promised, increased per- 
formance at a minimal hardware cost; however, it has 
also resulted in software systems which contain more 
errors. Sometimes, the impact of a software bug is 
nothing more than an inconvenience. At other times 
a software bug leads to costly downtime. But what will 
be the impact of design flaws in software systems used 
in life-critical applications such as industrial-plant con- 
trol, aircraft control, nuclear-reactor control, or nuclear- 
warhead arming? What will be the price of software 
failure as digital computers are applied more and more 
frequently to these and other life-critical functions? Al- 
ready, the symptoms of using insufficiently reliable soft- 
ware for life-critical applications are appearing [7, 14, 3]. 

For many years, much research has focused on the 
quantification of software reliability. Research efforts 
started with reliability growth models in the early 
1970’s. In recent years, an emphasis on developing 
methods which enable reliability quantification of soft- 
ware used for life-critical functions has emerged. The 
common approach which is offered is the combination 
of software fault-tolerance and statistical models. 

In this paper, we will investigate the software reliabil- 
ity problem from two perspectives. We will first explore 
the problems which arise when you test software as a 
black box, i.e. subject it to inputs and check the out- 
puts without examination of internal structure. Then, 
we will examine the problems which arise when software 
is not treated as a black box, i.e. some internal structure 
is modeled. In either case, we argue that the associated 
problems are intractable — i.e., they inevitably lead to a 
need for testing beyond what is practical. 
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2 Software Reliability 

For life-critical applications, the validation process must 
establish that system reliability is extremely high. His- 
torically, this ultrahigh reliability requirement has been 
translated into a probability of failure on the order of 
10“ 7 to 10“ 9 for 1 to 10 hour missions. Unfortunately, 
such probabilities create enormous problems for valida- 
tion. For convenience, we will use the following termi- 
nology: 


name 

failure rate (per hour) 

ultrareliability 

< 1CT 7 

moderate reliability 

10 -3 to 1CT 6 

low reliability 

> 1CT 3 


Software does not physically fail as hardware does. 
Physical failures (as opposed to hardware design flaws) 
occur when hardware wears out, breaks, or is adversely 
affected by environmental phenomena such as electro- 
magnetic Helds or alpha particles. Software is not sub- 
ject to these problems. Software faults are present at 
the beginning of and throughout a system’s lifetime. 
To such an extent, software reliability is meaningless — 
software is either correct or incorrect with respect to 
its specification. Nevertheless, software systems are em- 
bedded in stochastic environments. These environments 
subject the software program to a sequence of inputs 
over time. For each input, the program produces either 
a correct or an incorrect answer. Thus, in a systems con- 
text, the software system produces errors in a stochastic 
manner; the sequence of errors behaves like a stochastic 
point process. 

In this paper, the inherent difficulty of accurately 
modeling software reliability will be explored. To fa- 
cilitate the discussion, we will construct a simple model 
of the software failure process. The driver of the failure 
process is the external system that supplies inputs to 
the program. As a function of its inputs and internal 
state, the program produces an output. If the software 
were perfect, the internal state would be correct and the 
outputs produced would be correct. However, if there is 
a design flaw in the program, it can manifest itself either 
by production of an erroneous output or by corruption 
of the internal state (which may affect subsequent out- 
puts). 

In a real-time system, the software is periodically 
scheduled, i.e. the same program is repeatedly executed 
in response to inputs. It is not unusual to find “iteration 
rates” of 10 to 100 cycles per second. If the probability 
of software failure per input is constant, say p, we have 
a binomial process. The number of failures S n after n 
inputs is given by the binomial distribution: 

P(s n =k)= ( l ) P *(i- P )"-* 


We wish to compute the probability of system failure for 
n inputs. System failure occurs for all S n > 0. Thus, 

P sys (n) = P(S n > 0) = 1 - P(S n = 0) = 1 - (1 - p) n 

This can be converted to a function of time with the 
transformation n = Kt where K = the number of inputs 
per unit time. The system failure probability at time t, 
P sy s(t), is thus: 

P sys (t) = l-(l-p) Kt (1) 

Of course, this calculation assumes that the probability 
of failure per input is constant over time. 1 

This binomial process can be accurately approxi- 
mated by an exponential distribution since p is small 
and n is large: 

P sys (t) = 1 -e~ Kt P 

This is easily derived using the Poisson approximation 
to the binomial. The discrete binomial process can thus 
be accurately modeled by a continuous exponential pro- 
cess. In the following discussion, we will frequently use 
the exponential process rather than the binomial pro- 
cess to simplify the discussion. 

3 Analyzing Software as a Black 
Box 

The traditional method of validating reliability is life 
testing. In life testing, a set of test specimens are oper- 
ated under actual operating conditions for a predeter- 
mined amount of time. Over this period, failure times 
are recorded and subsequently used in reliability com- 
putation. The internal structure of the test specimens 
is not examined. The only observable is whether a spec- 
imen has failed or not. 

For systems that are designed to attain a probabil- 
ity of failure on the order of 10“ 7 to 10“ 9 for 1 hour 
missions or longer, life testing is prohibitively imprac- 
tical. This can be shown by an illustrative example. 
For simplicity, we will assume that the time to failure 

J If the probability of failure per input were not constant, 
then the reliability analysis problem is even harder. One 
would have to estimate p(t) rather than just p. A time- 
variant system would require even more testing than a time- 
invariant one, since the rate must be determined as a func- 
tion of mission time. The system would have to be placed 
in a random state corresponding to a specific mission time 
and subjected to random inputs. This would have to be 
done for each time point of interest within the mission time. 
Thus, if the reliability analysis is intractable for systems with 
constant p, it is unrealistic to expect it to be tractable for 
systems with non-constant p(t). 
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distribution is exponential. 2 Using standard statisti- 
cal methods [11], the time on test can be estimated for 
a specified system reliability. There are two basic ap- 
proaches: (1) testing with replacement and (2) testing 
without replacement. In either case, one places n items 
on test. The test is finished when r failures have been 
observed. In the first case, when a device fails a new 
device is put on test in its place. In the second case, a 
failed device is not replaced. The tester chooses values 
of n and r to obtain the desired levels of the a and [3 
errors (i.e. , the probability of rejecting a good system 
and the probability of accepting a bad system respec- 
tively.) In general, the larger r and n are, the smaller 
the statistical estimation errors are. The expected time 
on test can be calculated as a function of r and n. The 
expected time on test, A, for the replacement case is: 

A = Ho— (2) 

n 


where fi 0 is the mean failure time of the test specimen 
[11]. The expected time on test for the non-replacement 
case is: 


A 



1 

n — j + 1 


( 3 ) 


Even without specifying an a or [3 error, a good in- 
dication of the testing time can be determined. Clearly, 
the number of observed failures r must be greater than 
0 and the total number of test specimens n must be 
greater than or equal to r. For example, suppose the 
system has a probability of failure of 10 -9 for a 10 hour 
mission. Then the mean time to failure of the system 
(assuming exponentially distributed) fi 0 is: 


10 

“ -ln[ 1 - IQ” 9 ] 


10 10 


Table 1 shows the expected test duration for this sys- 
tem as a function of the number of test replicates n for 
r = l. 3 * It should be noted that a value of r equal to 1 
produces the shortest test time possible but at the price 
of extremely large a and [3 errors. To get satisfactory 
statistical significance, larger values of r are needed and 
consequently even more testing. Therefore, given that 
the economics of testing fault-tolerant systems (which 
are very expensive) rarely allow n to be greater than 
10, life-testing is clearly out of the question for ultra- 
reliable systems. The technique of statistical life-testing 
is discussed in more detail in the appendix. 

2 In the previous section the exponential process was 
shown to be an accurate approximation to the discrete bi- 
nomial software failure process. 

3 The expected time with or without replacement is al- 

most the same in this case. 


no. of replicates (n) 

Expected Test Duration A 

1 

10 10 hours = 1141550 years 

10 

10 9 hours = 114155 years 

100 

10 s hours = 11415 years 

10000 

10 6 hours =114 years 


Table 1: Expected Test Duration For r=l 


4 Reliability Growth Models 

The software design process involves a repetitive cycle 
of testing and repairing a program. A program is sub- 
jected to inputs until it fails. The cause of failure is 
determined; the program is repaired and is then sub- 
jected to a new sequence of inputs. The result is a se- 
quence of programs pi , p> 2 , ■■■,Pn an d a sequence of inter- 
failure times Ti, T 2 , ...,T n (usually measured in number 
of inputs). The goal is to construct a mathematical 
technique (i.e. model) to predict the reliability of the fi- 
nal program p n based on the observed interfailure data. 
Such a model enables one to estimate the probability 
of failure of the final “corrected” program without sub- 
jecting it to a sequence of inputs. This process is a form 
of prediction or extrapolation and has been studied in 
detail [1, 10, 8]. These models are called “Reliability 
Growth Models”. If one resists the temptation to cor- 
rect the program based on the last failure, the method 
is equivalent to black-box testing the final version. If 
one corrects the final version and estimates the reliabil- 
ity of the corrected version based on a reliability growth 
model, one hopefully has increased the efficiency of the 
testing process in doing so. The question we would like 
to examine is how much efficiency is gained by use of a 
reliability growth model and is it enough to get us into 
the ultrareliable region. Unfortunately, the answer is 
that the gain in efficiency is not anywhere near enough 
to get us into the ultrareliable region. This has been 
pointed out by several authors. Keiller and Miller write 
[4]: 


The reliability growth scenario would start 
with faulty software. Through execution of 
the software, bugs are discovered. The soft- 
ware is then modified to correct for the de- 
sign flaws represented by the bugs. Gradu- 
ally the software evolves into a state of higher 
reliability. There are at least two general rea- 
sons why this is an unreasonable approach to 
highly-reliable safety- critical software. The 
time required for reliability to grow to ac- 
ceptable levels will tend to be extremely long. 
Extremely high levels of reliability cannot be 
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guaranteed a priori. 

Littlewood writes [9]: 

Clearly, the reliability growth techniques of 
§2 [a survey of ihe leading reliability growth 
models] are useless in the face of such ultra- 
high reliability requirements. It is easy to see 
that, even in the unlikely event that the sys- 
tem had achieved such a reliability, we could 
not assure ourselves of that achievement in an 
acceptable time. 

The problem alluded to by these authors can be seen 
clearly by applying a reliability growth model to exper- 
imental data. The data of table 2 was taken from an 
experiment performed by Nagel and Skrivan [13]. The 


version 

failure probability per input 

i 

0.9803 

2 

0.1068 

3 

0.002602 

4 

0.002104 

5 

0.001176 

6 

0.0007659 


Table 2: Nagel Data From Program A1 


data in this table was obtained for program Al, one of 
six programs investigated. The versions represent the 
successive stages of the program as bugs were removed. 
A log-linear growth model was postulated and found to 
fit all 6 programs analyzed in the report. Even if the 
log-linear model was found to apply to the ultrareliable 
region, it would not alleviate the problem. Let’s suppose 
the debugging process was continued on the above pro- 
gram until it reached ultrareliability. From the growth 
model we can estimate how many bugs would have to 
be removed from this program in order to get it to the 
ultrareliable region. A simple regression on the data of 
table 2 yields a slope and y-intercept of: —1.415 and 
.2358, respectively. The correlation coefficient is -0.913. 
Even if the system were extremely slow, say 1 input 
processed per minute, the failure rate per input must 
be less than 10 _9 /60 = 1.67 x 10 -11 in order for the 
program to have a failure rate of 10 ~ 9 /hour. Using the 
regression results, it can be seen that approximately 17 
bugs must be removed: 


bug 

failure rate per input 

16 

1.87332 x 10 _1 ° 

17 

4.55249 x 10 -11 

18 

1.10633 x 10 -11 


Thus one could test until 17 bugs have been removed, 
remove the last bug and use the reliability growth model 
to predict a failure rate per input of 1.106 x 10 -11 . But, 
how long would it take to remove the 17 bugs? Well, 
the removal of the last bug alone would on average re- 
quire approximately 2.2 x 10 10 test cases. Even if the 
testing process were 1000 times faster than the opera- 
tional time per input (i.e. one input tested in 60/1000 
secs) 4 , this would require 42 years of testing. Thus, we 
see why Littlewood, Keiller and Miller see little hope 
of using reliability growth models for ultrareliable soft- 
ware. This problem is not restricted to the program 
above but is universal. Table 3 repeats the above cal- 
culations for the rest of the programs in reference [13]. 5 
At even the most optimistic improvement rates, it is 


program 

slope 

y-intercept 

last bug 

test time 

Al 

-1.415 

2.358 

17 

42 years 

B1 

-1.3358 

1.1049 

19 

66 years 

A2 

-1.998 

2.4572 

13 

31 years 

B2 

-3.623 

2.3296 

7 

19 years 

A3 

-.54526 

-1.3735 

42 

66 years 

B3 

-1.3138 

0.0912 

19 

32 years 


Table 3: Test Time To Remove the Last Bug to 
Obtain Ultrareliability 

obvious that reliability growth models are impractical 
for ultrareliable software. If the number of inputs per 
hour is increased to the more typical values of 10,000 to 
100,000, then the above test times reach into the tens 
and hundreds of thousands of years. 

5 Software Fault Tolerance 

Since fault tolerance has been successfully used to pro- 
tect against hardware physical failures, it seems natural 
to apply the same strategy against software bugs. It 
is easy to construct a reliability model of a system de- 
signed to mask physical failures using redundant hard- 
ware and voting. The key assumption which enables 
both the design of ultrareliable systems from less reliable 
components and the estimation of 10“ 9 probabilities of 
failure is that the separate redundant components fail 
independently or nearly so. The independence assump- 
tion has been used in hardware fault tolerance mod- 

4 This is very optimistic since one must not only run the 
test cases but also determine whether the answer is correct 
or not. 

5 Table 3 assumes a perfect fit with the log-linear model 
in the ultrareliable region. 
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elling for many years. If the redundant components are 
located in separate chassis, powered by separate power 
supplies, electrically isolated from each other and suf- 
ficiently shielded from the environment it is not unrea- 
sonable to assume failure independence of physical hard- 
ware faults. 

The basic strategy of the software fault-tolerance ap- 
proach is to design several versions of a program from 
the same specification and to employ a voter of some 
kind to protect the system from bugs. The voter can 
be an acceptance test (i.e. , recovery blocks) or a com- 
parator (i.e. , N-version programming). Each version is 
programmed by a separate programming team. 6 Since 
the versions are developed by separate programming 
teams, it is hoped that the redundant programs will 
fail independently or nearly so [2, 15]. From the ver- 
sion reliability estimates and the independence assump- 
tion, system reliability estimates could be calculated. 
However, unlike hardware physical failures which are 
governed by the laws of physics, programming errors 
are the products of human reasoning (i.e., actually im- 
proper reasoning). The question thus becomes one of 
the reasonableness of assuming independence based on 
little or no practical or theoretical foundations. Subjec- 
tive arguments have been offered on both sides of this 
question. Unfortunately, the subjective arguments for 
multiple versions being independent are not compelling 
enough to qualify it as an axiom. The reasons why 
experimental justification of independence is infeasible 
and why ultrareliable quantification is infeasible despite 
software fault tolerance are discussed in the next sec- 
tion. 

5.1 Models of Software Fault Tolerance 

Many reliability models of fault-tolerant software have 
been developed based on the independence assumption. 
To accept such a model, this assumption must be ac- 
cepted. In this section, it will be shown how the inde- 
pendence assumption enables quantification in the ul- 
trareliable region, why quantification of fault-tolerant 
software reliability is unlikely without the independence 
assumption, and why this assumption cannot be exper- 
imentally justified for the ultrareliable region. 

5.1.1 Independence enables quantification 
of ultrareliability 

The following example will show how independence en- 
ables ultrareliability quantification. Suppose three dif- 

6 Often these separate programming teams are called “in- 

dependent programming” teams. The phrase “independent 

programming” does not mean the same thing as “indepen- 
dent manifestation of errors.” 


ferent versions of a program control a life-critical system 
using some software fault tolerance scheme. Let Ei k be 
the event that the ith version fails on its fcth execution. 
Suppose the probability that version i fails during the 
Mh execution is Pi ik - As discussed in section 2, we will 
assume that the failure rate is constant. Since the ver- 
sions are voted, the system does not fail unless there 
is a coincident error, i.e., two or more versions produce 
erroneous outputs in response to the same input. The 
probability that two or more versions fail on the fcth 
execution causing system failure is: 

P sy s,k = P{ (Ei,k A E 2 ,k) or (E lik A E 3tk ) or 
(p 2 ,k A E 3t k) or (Ei t k A E 2> k A E 3<k ) ) 

(4) 

Using the additive law of probability, this can be written 
as: 

Psys,k = P(Ei t k A E 2> k) + P(El,k A E 3t k) 

+P(E 2>k A E 3>k ) (5) 

—2P(Ei t k A E 2t k A E 3t k) 

If independence of the versions is assumed, this can be 
rewritten as: 

Psys,k = P{Ei t k)P{E 2t k) + P{Ei t k)P{E 3t k) 

+P(E 2t k)P(E 3t k) (6) 

-2 P(E lt k)P(E 2t k)P(E 3tk ) 

The reason why independence is usually assumed is ob- 
vious from the above formula — if each P(Ei k ) can be 
estimated to be approximately 10 — 6 , then the proba- 
bility of system failure due to two or more coincident 
failures is approximately 3 x 10 -12 . 

Equation (6) can be used to calculate the probability 
of failure for a T hour mission. Suppose P(Ei k ) = p for 
all i and k. Then 

P sys ,k = 3 p 2 - 2 p 3 PS 3 p 2 

and the probability that the system fails during a mis- 
sion of T hours can be calculated using equation (1): 

P sys (T) = 1 - (1 - P sys , k ) KT « 1 - (1 - 3 p 2 ) KT 

where K = the number of executions of the program in 
an hour. For small pi the following approximation is 
accurate: 

P sys (T) ps 1 - e ( - 3p2KT '> ps 3p 2 KT 

For the following typical values of T = 1 and K = 3600 
(i.e., 1 execution per second), we have 

P sys (T) ps 3p 2 KT = 3(10 _6 )(10 _6 )(3600) = 1.08xl0“ 8 
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Thus, an ultrareliability quantification has been made. 
But, this depended critically on the independence as- 
sumption. If the different versions do not fail indepen- 
dently, then equation (4) must be used to compute fail- 
ure probabilities and the above calculation is meaning- 
less. In fact, the probability of failure could be anywhere 
from 0 to about 10 -2 (i.e. , 0 to 3pKT 7 ). 


5.1.2 Ultrareliable quantification is infeasi- 
ble without independence 

Now consider the impact of not being able to assume in- 
dependence. The following argument was adapted from 
Miller [12]. To simplify the notation, the last subscript 
will be dropped when referring to the fcth execution 
only. Thus, 


P sys — £(£i A E 2 ) + P{E\ A £3) + £(£2 A £3) 

— 2£(£i A £2 A £3) 

( 7 ) 

Using the identity P(AAB) = P(A)P(B) + [P(AAB) — 
P(A)P(B)\, this can be rewritten as: 

P sys = P(E 1 )P(E 2 ) + P(E 1 )P(E 3 ) + P(E 2 )P(E 3 ) 

—2P(Ei)P(E 2 )P(E 3 ) 

+ [£(£ 2 A£i)-£(£i)£(£ 2 )] 

+ [£(£ 3 A£i)-£(£ 3 )£(£i)] 

+ [£(£3 A £2) - £(£ 3 )£(£ 2 )] 

-2[£(£i A £ 2 A £3 ) - £(£i)£(£ 2 )£(£ 3 )] 

(8) 

This rewrite of the formula reveals two components of 
the system failure probability: (1) the first two lines 
of equation 8 and (2) the last 4 lines of equation 8. 
If the multiple versions manifest errors independently, 
then the last four lines (i.e. the second component) will 
be equal to zero. Consequently, to establish indepen- 
dence experimentally, these terms must be shown to be 
0. Realistically, to establish “adequate” independence, 
these terms must be shown to have negligible effect on 
the probability of system failure. Thus, the first com- 
ponent represents the “non-correlated” contribution to 
P sys and the second component represents the “corre- 
lated” contribution to P sys ■ Note that the terms in the 
first component of P sys are all products of the individual 
version probabilities. 

If we cannot assume independence, we are back to 
the original equation (7). Since £(£ 1 A E 2 A £3) < 
£(£; A Ej ) for all i and j, we have 

£(£; A Ej) < P S ys for all i,j. 

Clearly, if P sys < 10 -9 then £(£ s A£j) < 10 -9 . In other 
words, in order for P sys to be in the ultrareliable region, 

7 3pKT is a first-order approximation to the probability 
that the system fails whenever any one of the 3 versions fail. 


the interaction terms (i.e. £(£; A Ej)) must also be in 
the ultrareliable region. To establish that the system 
is ultrareliable, the validation must either demonstrate 
that these terms are very small or establish that P sys 
is small by some other means (from which we could 
indirectly deduce that these terms are small.) Thus, we 
are back to the original life-testing problem again. 

From the above discussion, it is tempting to conclude 
that it is necessary to demonstrate that each of the in- 
teraction terms is very small in order to establish that 
P S ys is very small. However, this is not a legitimate 
argument. Although the interaction terms will always 
be small when P sys is small, one cannot argue that the 
only way of establishing that P sys is small is by show- 
ing that the interaction terms are small. However, the 
likelihood of establishing that P sys is very small with- 
out directly establishing that all of the interaction terms 
are small appears to be extremely remote. This follows 
from the observation that without further assumptions, 
there is little more that can be done with equation (7). 
It seems inescapable that no matter how (7) is manipu- 
lated, the terms £(£;A£j) will enter in linearly. Unless, 
a form can be found where these terms are eliminated 
altogether or appear in a non-linear form where they 
become negligible (e.g. all multiplied by other param- 
eters), the need to estimate them directly will remain. 
Furthermore, the information contained in these terms 
must appear somewhere. The dependency of P sys on 
some formulation of interaction cannot be eliminated. 

Although the possibility that a method may be dis- 
covered for the validation of software fault-tolerance re- 
mains, it is prudent to recognize where this opportunity 
lies. It does not lie in the realm of controlled exper- 
imentation. The only hope is that a reformulation of 
equation (7) can be discovered that enables the estima- 
tion of P sys from a set of parameters which can be esti- 
mated using moderate amounts of testing. The efficacy 
of such a reformulation could be assessed analytically 
before any experimentation. 

5.1.3 Danger of extrapolation to the ultra- 
reliability region 

To see the danger in extrapolating from a feasible 
amount of testing that the different versions are inde- 
pendent, we will consider some possible scenarios for co- 
incident failure processes. Suppose that the probability 
of failure of a single version during a 1 hour interval is 
10 -5 . If the versions fail independently, then the prob- 
ability of a coincident error is on the order of 10 _1 °. 
However, suppose in actuality the arrival rate of a coin- 
cident error is 10 ~ 7 /hour. One could test for 100 years 
and most likely not see a coincident error. From such 
experiments it would be tempting to conclude that the 
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different versions are independent. After all, we have 
tested the system for 100 years and not seen even one 
coincident error! If we make the independence assump- 
tion, the system reliability is (1 — 3 x 10 -10 ). But actu- 
ally the system reliability is approximately (1 — 10 -7 ). 
Likewise, if the failure rate for a single version were 
10 ~ A /hour and the arrival rate of coincident errors were 
10 ~ 5 /hour, testing for one year would most likely result 
in no coincident errors. The erroneous assumption of in- 
dependence would allow the assignment of a 3 x 10 -8 
probability of failure to the system when in reality the 
system is no better than 10 -5 . 

In conclusion, if independence cannot be assumed, 
it seems inescapable that the intersection of the events 
Ei, E2, and £3 (i.e. P(Ei A Ej)) must be directly mea- 
sured. As shown above, these occur in the system fail- 
ure formula not as products, but alone, and thus must 
be less than 10“ 12 per input in order for the system 
probability of failure to be less than 10 -9 at 1 hour. 
Unfortunately, testing to this level is infeasible and ex- 
trapolation from feasible amounts of testing is danger- 
ous. 

Since ultrareliability has been established as a re- 
quirement for many systems, there is great incentive 
to create models which enable an estimate in the ul- 
trareliable region. Thus, there are many examples of 
software reliability models for operational ultrareliable 
systems. Given the ramifications of independence on 
fault-tolerant software reliability quantification, unjus- 
tifiable assumptions must not be overlooked. 

5.2 Feasibility of a General Model For 
Coincident Errors 

Given the limitations imposed by non-independence, 
one possible approach to the ultrareliability quantifica- 
tion problem is to develop a general fault-tolerant soft- 
ware reliability model that accounts for coincident er- 
rors. Two possibilities exist: 

1 . The model includes terms which cannot be mea- 
sured within feasible amounts of time. 

2 . The model includes only parameters which can be 
measured within feasible amounts of time. 

It is possible to construct elaborate probability mod- 
els which fall into the first class. Unfortunately since 
they depend upon unmeasurable parameters, they are 
useless for the quantification of ultrareliability. The sec- 
ond case is the only realistic approach. 8 * The indepen- 
dence model is an example of the second case. Mod- 
els belonging to the second case must explicitly or im- 

8 The first case is included for completeness and because 

such models have been proposed in the past. 


plicitly express the interaction terms in equation ( 7 ) as 
“known” functions of parameters which can be mea- 
sured in feasible amounts of time. The known function 
in the independence model is the zero function, i.e., the 
interaction terms are zero identically irrespective of any 
other measurable parameters. 

A more general model must provide a mechanism 
that makes these interaction terms negligibly small in 
order to produce a number in the ultrareliable region. 
These known functions must be applicable to all cases of 
multi-version software for which the model is intended. 
Clearly, any estimation based on such a model would 
be strongly dependent upon correct knowledge of these 
functions. But how can these functions be determined? 
There is little hope of deriving them from fundamen- 
tal laws, since the error process occurs in the human 
mind. The only possibility is to derive them from exper- 
imentation, but experimentation can only derive func- 
tions appropriate for low or moderate reliability soft- 
ware. Therefore, the correctness of these functions in 
the ultrareliable region can not be established experi- 
mentally. Justifying the correctness of the known func- 
tions requires far more testing than quantifying the reli- 
ability of a single ultrareliable system. The model must 
be shown to be applicable to a specified sample space 
of multi-version programs. Thus, there must be exten- 
sive sampling from the space of multi- version programs, 
each of which must undergo life-testing for over 100,000 
years in order to demonstrate the universal applicabil- 
ity of the functions. Thus, in either case, the situation 
appears to be hopeless — the development of a credible 
coincident error model which can be used to estimate 
system reliability within feasible amounts of time is not 
possible. 

5.3 The Coincident-Error Experiments 

Experiments have been performed by several researchers 
to investigate the coincident error process. The first 
and perhaps most famous experiment was performed by 
Knight and Leveson [ 5 ]. In this experiment 27 versions 
of a program were produced and subjected to 1,000,000 
input cases. The observed average failure rate per input 
was 0 . 0007 . The major conclusion of the experiment was 
that the independence model was rejected at the 99 % 
confidence level. The quantity of coincident errors was 
much greater than that predicted by the independence 
model. Experiments produced by other researchers have 
confirmed the Knight-Leveson conclusion [ 15 , 16 ]. A 
excellent discussion of the experimental results is given 
in [6]. 

Some debate [6] has occurred over the credibility of 
these experiments. Rather than describe the details of 
this debate, we would prefer to make a few general ob- 
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servations about the scope and limitations of such ex- 
periments. First, the N-version systems used in these 
experiments must have reliabilities in the low to mod- 
erate reliability region. Otherwise, no data would be 
obtained which would be relevant to the independence 
question. 9 It is not sufficient (to get data) that the indi- 
vidual versions are in this reliability region. The coinci- 
dent error rate must be observable, so the reliability of 
“voted” outputs must be in the low to moderate reliabil- 
ity region. To see this consider the following. Suppose 
that we have a 3-version system where each replicate’s 
failure rate is 10 ~ 4 /hour. If they fail independently, 
the coincident error rate should be 3 x 10 ~ 8 /hour. The 
versions are in the moderate reliability region, but the 
system is potentially (i.e. if independent) in the ultra- 
reliable region. In order to test for independence, “co- 
incident” errors must be observed. If the experiment 
is performed for one year and no coincident errors are 
observed, then one can be confident that the coincident 
error rate (and consequently the system failure rate) is 
less than 1.14 x 10 -4 . If coincident errors are observed 
then the coincident error rate is probably even higher. 
If the coincident error rate is actually 10 ~ 7 /hour, then 
the independence assumption is invalid, but one would 
have to test for over 1000 years in order to have a reason- 
able chance to observe them! Thus, future experiments 
will have one of the following results depending on the 
actual reliability of the test specimens: 

1. demonstration that the independence assumption 
does not hold for the low reliability system. 

2. demonstration that the independence assumption 
does hold for systems for the low reliability system. 

3. no coincident errors were seen but the test time 
was insufficient to demonstrate independence for 
the potentially ultrareliable system. 

If the system under test is a low reliability system, the 
independence assumption may be contradicted or vin- 
dicated. Either way, the results will not apply to ul- 
trareliable systems except by way of extrapolation. If 
the system under test were actually ultrareliable, the 
third conclusion would result. Thus, experiments can 
reveal problems with a model such as the independence 
model when the inaccuracies are so severe that they 
manifest themselves in the low or moderate reliability 
region. However, software reliability experiments can 
only demonstrate that an interaction model is inaccu- 
rate, never that a model is accurate for ultrareliable 
software. Thus, negative results are possible, but never 
positive results. 

9 that is, unless one was willing to carry out a “Smith- 
sonian” experiment, i.e. one which requires centuries to 
complete. 


The experiments performed by Knight and Leveson 
and others have been useful to alerting the world to 
a formerly unnoticed critical assumption. However, it 
is important to realize that these experiments cannot 
accomplish what is really needed — namely, to establish 
with scientific rigor that a particular design is ultra- 
reliable or that a particular design methodology pro- 
duces ultrareliable systems. This leaves us in a terrible 
bind. We want to use digital processors in life-critical 
applications, but we have no feasible way of establishing 
that they meet their ultrareliability requirements. We 
must either change the reliability requirements to a level 
which is in the low to moderate reliability region or give 
up the notion of experimental quantification. Neither 
option is very appealing. 

6 Conclusions 

In recent years, computer systems have been introduced 
into life-critical situations where previously caution had 
precluded their use. Despite alarming incidents of dis- 
aster already occurring with increasing frequency, in- 
dustry in the United States and abroad continues to ex- 
pand the use of digital computers to monitor and control 
complex real-time physical processes and mechanical de- 
vices. The potential performance advantages of using 
computers over their analog predecessors have created 
an atmosphere where serious safety concerns about digi- 
tal hardware and software are not adequately addressed. 
Although fault-tolerance research has discovered effec- 
tive techniques to protect systems from physical compo- 
nent failure, practical methods to prevent design errors 
have not been found. Without a major change in the 
design and verification methods used for life-critical sys- 
tems, major disasters are almost certain to occur with 
increasing frequency. 

Since life-testing of ultrareliable software is infeasi- 
ble (i.e., to quantify 10 ~ 8 /hour failure rate requires 
more than 10 s hours of testing), reliability models of 
fault-tolerant software have been developed from which 
ultrareliable-system estimates can be obtained. The key 
assumption which enables an ultrareliability prediction 
for hardware failures is that the electrically isolated pro- 
cessors fail independently. This assumption is reason- 
able for hardware component failures, but not provable 
or testable. This assumption is not reasonable for soft- 
ware or hardware design flaws. Furthermore, any model 
which tries to include some level of non-independent in- 
teraction between the multiple versions can not be justi- 
fied experimentally. It would take more than 10 8 hours 
of testing to make sure there are not coincident errors 
in two or more versions which appear rarely but fre- 
quently enough to degrade the system reliability below 


8 



(1 - 10 “ 8 ). 

Some significant conclusions can be drawn from the 
observations of this paper. Since digital computers will 
inevitably be used in life-critical applications, it is nec- 
essary that “credible” methods be developed for gen- 
erating reliable software. Nevertheless, what consti- 
tutes a “credible” method must be carefully reconsid- 
ered. A pervasive view is that software validation must 
be accomplished by probabilistic and statistical meth- 
ods. The shortcomings and pitfalls of this view have 
been expounded in this paper. Based on intuitive mer- 
its, it is likely that software fault tolerance will be used 
in life-critical applications. Nevertheless, the ability of 
this approach to generate ultrareliable software cannot 
be demonstrated by research experiments. The question 
of whether software fault tolerance is more effective than 
other design methodologies such as formal verification 
or vice versa can only be answered for low or moder- 
ate reliability systems, not for ultrareliable applications. 
The choice between software fault tolerance and formal 
verification must necessarily be based on either extrap- 
olation or nonexperimental reasoning. 

Similarly, experiments designed to compare the accu- 
racy of different types of software reliability models can 
only be accomplished in the low to moderate reliability 
regions. There is little reason to believe that a model 
which is accurate in the moderate region is accurate in 
the ultrareliable region. It is possible that models which 
are inferior to other models in the moderate region are 
superior in the ultrareliable region — again, this cannot 
be demonstrated. 


Appendix 


A hypothesis test is constructed to test the reliability of 
the system against an alternative. 

H 0 : Reliability = Ro 
H i : Reliability < Ro 

The null hypothesis covers the case where the system is 
ultrareliable. The alternative covers the case where the 
system fails to meet the reliability requirement. The a 
error is the probability of rejecting the null hypothesis 
when it is true (i.e. producer’s risk). The [3 error is the 
probability of accepting the null hypothesis when it is 
false (i.e. consumer’s risk). 

There are two basic experimental approaches — (1) 
testing with replacement and (2) testing without re- 
placement. In either case, one places n items on test. 
The test is finished when r failures have been observed. 
In the first case, when a device fails a new device is put 
on test in its place. In the second case, a failed device 
is not replaced. The tester chooses values of n and r to 
obtain the desired levels of the a and [3 errors. In gen- 
eral, the larger r and n are, the smaller the statistical 
testing errors are. 

It is necessary to assume some distribution for the 
time-to-failure of the test specimen. For simplicity, we 
will assume that the distribution is exponential. 11 The 
test then can be reduced to a test on exponential means, 
using the transformation: 

_ t 
* ~ -ln[R(t)\ 

The expected time on test can then be calculated as a 
function of r and n. The expected time on test, D t , for 
the replacement case is: 


In this section, the statistics of life testing will be briefly 
reviewed. A more detailed presentation can be found in 
a standard statistics text book such as Mann-Schafer- 
Singpurwalla [11]. This section presents a statistical 
test based on the maximum likelihood ratio 10 and was 
produced using reference [11] extensively. The mathe- 
matical relationship between the number of test speci- 
mens, specimen reliability, and expected time on test is 
explored. 


Let n 
r 


X 1 < x 2 < ... < x r 


the number of test 

specimens 

observed number of 

specimen failures 

the ordered failure times 


10 The maximum likelihood ratio test is the test which pro- 
vides the “best” critical region for a given a error. 


A = h »- 

n 


(9) 


where fi 0 is the mean time to failure of the test specimen. 
The expected time on test for the non-replacement case 
is: 


A 



1 

n — j + 1 


(10) 


In order to calculate the a and [3 errors, a specific 
value of the alternative mean must be selected. Thus, 
the hypothesis test becomes: 


H 0 : n = Ho 

Hi : h = Ha 

11 If the failure times follow a Weibull distribution with 
known shape parameter, the data can be transformed into 
variables having exponential distributions before the test is 
applied. 
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A reasonable alternative hypothesis is that the reliabil- 
ity at 10 hours is 1 — 10 -8 or that fi a = 10 9 . The test 
statistic T r is given by 

r 

T r = (n-r)X r +Y,Xi 

8 = 1 

for the non-replacement case and 


T r = nX r 


for the “replacement case” . The critical value T c (for 
which the null hypothesis should be rejected whenever 
T r < T c ) can be determined as a function of a and r: 

„ Xlr,a 
T c = Vo^— 

where xf a is the a percentile of the chi-square distribu- 
tion with v degrees of freedom. Given a choice of r and 
a the value of the “best” critical region is determined 
by this formula. The [3 error can be calculated from 


T c = Hi 


X2r,l-/3 


Neither of the above equations can be solved until r 
is determined. However, the following formula can be 
derived from them: 


X2 r,a He 

X%r,l-P 


( 11 ) 


Given the desired a and [3 errors, one chooses the small- 
est r which satisfies this equation. 


Example 1 

Suppose that we wish to test: 

H 0 -.Ho = 10 10 

Hi: Ha = 10 9 

For a = 0.05 and [3 = 0.01, the smallest r satisfying 
equation (11) is 3 (using a chi-square table). Thus, the 

critical region is Ho x S 2 ,a = 10 lo (1.635)/2 = 8.18 x 10 9 . 
The experimenter can choose any value of n greater than 
r. The larger n is, the shorter the expected time on test 
is. For the replacement case, the expected time on test 


no. of replicates (n) 

Expected Test Duration D t 

10 

3 x 10 9 hours 

100 

3 x 10 s hours 

10000 

3 x 10 6 hours 


Even with 10000 test specimens, the expected test time 
is 342 years. 


Example 2 

Suppose that we wish to test: 

H 0 :Ho = 10 10 

Hi : Ha = 10 9 

Given a = 0.05 and r = 1, the [3 error can be calculated. 

First the critical region is Ho X2T ,^ a = 10 lo [0.1026]/2 = 
5.13 x 10 s . From a chi-square table, the (3 error can be 
seen to be greater than 0.50. 


Illustrative Table 

For Ho = 10 10 and Ha = 10 9 , 


The following relationship exists between a, r, and f3: 


a 

r 

p 

.01 

5 

« .005 

.01 

3 

« .20 

.01 

2 

« .50 

.05 

3 

« .02 

.05 

2 

« .10 

.05 

1 

« .50 

.10 

3 

« .005 

.10 

2 

« .03 

.10 

1 

« .25 


The power of the test 1 — [3 changes drastically with 
changes in r. Clearly r must be at least 2 to have a 
reasonable value for the beta error. 
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